{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "positional_encoding.ipynb",
      "provenance": [],
      "collapsed_sections": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "7fjcTlyE3WvR"
      },
      "source": [
        "#A Positional Encoding Example\n",
        "Copyright 2021 Denis Rothman, MIT License\n",
        "\n",
        "Reference 1 for Positional Encoding:\n",
        "Attention is All You Need paper, page 6,Google Brain and Google Research\n",
        "\n",
        "\n",
        "Reference 2 for word embedding:\n",
        "https://www.geeksforgeeks.org/python-word-embedding-using-word2vec/\n",
        "Reference 3 for cosine similarity:\n",
        "SciKit Learn cosine similarity documentation\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "JKJ8Saf6vR9b",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "654ce4ae-0115-46d4-a186-ee2581f1ee4f"
      },
      "source": [
        "!pip install gensim==3.8.3\n",
        "import torch\n",
        "import nltk\n",
        "nltk.download('punkt')"
      ],
      "execution_count": 6,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Requirement already satisfied: gensim==3.8.3 in /usr/local/lib/python3.7/dist-packages (3.8.3)\n",
            "Requirement already satisfied: scipy>=0.18.1 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (1.4.1)\n",
            "Requirement already satisfied: smart-open>=1.8.1 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (4.2.0)\n",
            "Requirement already satisfied: numpy>=1.11.3 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (1.19.5)\n",
            "Requirement already satisfied: six>=1.5.0 in /usr/local/lib/python3.7/dist-packages (from gensim==3.8.3) (1.15.0)\n",
            "[nltk_data] Downloading package punkt to /root/nltk_data...\n",
            "[nltk_data]   Package punkt is already up-to-date!\n"
          ],
          "name": "stdout"
        },
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "True"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 6
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PGXgeOyS5qBP"
      },
      "source": [
        "# upload to the text.txt file to Google Colaboratory using the file manager "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "7o7EeDUUu0Sh"
      },
      "source": [
        "import math\n",
        "import numpy as np\n",
        "from nltk.tokenize import sent_tokenize, word_tokenize \n",
        "import gensim \n",
        "from gensim.models import Word2Vec \n",
        "import numpy as np\n",
        "from sklearn.metrics.pairwise import cosine_similarity\n",
        "import matplotlib.pyplot as plt\n",
        "import warnings \n",
        "warnings.filterwarnings(action = 'ignore') \n",
        "\n",
        "\n",
        "dprint=0 # prints outputs if set to 1, default=0\n",
        "\n",
        "#‘text.txt’ file \n",
        "sample = open(\"text.txt\", \"r\") \n",
        "s = sample.read() \n",
        "\n",
        "# processing escape characters \n",
        "f = s.replace(\"\\n\", \" \") \n",
        "\n",
        "data = [] \n",
        "\n",
        "# sentence parsing \n",
        "for i in sent_tokenize(f): \n",
        "\ttemp = [] \n",
        "\t# tokenize the sentence into words \n",
        "\tfor j in word_tokenize(i): \n",
        "\t\ttemp.append(j.lower()) \n",
        "\tdata.append(temp) \n",
        "\n",
        "# Creating Skip Gram model \n",
        "#model2 = gensim.models.Word2Vec(data, min_count = 1, size = 512,window = 5, sg = 1) \n",
        "#model = Word2Vec(sentences=common_texts, vector_size=100, window=5, min_count=1, workers=4)\n",
        "model2 = gensim.models.Word2Vec(data, min_count = 1, size = 512,window = 5, sg = 1)\n",
        "\n",
        "# 1-The 2-black 3-cat 4-sat 5-on 6-the 7-couch 8-and 9-the 10-brown 11-dog 12-slept 13-on 14-the 15-rug.\n",
        "word1='black'\n",
        "word2='brown'\n",
        "pos1=2\n",
        "pos2=10\n",
        "a=model2[word1]\n",
        "b=model2[word2]\n",
        "\n",
        "if(dprint==1):\n",
        "        print(a)\n",
        "\n",
        "# compute cosine similarity\n",
        "dot = np.dot(a, b)\n",
        "norma = np.linalg.norm(a)\n",
        "normb = np.linalg.norm(b)\n",
        "cos = dot / (norma * normb)\n",
        "\n",
        "aa = a.reshape(1,512) \n",
        "ba = b.reshape(1,512)\n",
        "cos_lib = cosine_similarity(aa, ba)"
      ],
      "execution_count": 7,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "xlTeXmatz7bP"
      },
      "source": [
        "A Positional Encoding example using one line of basic Python using a few lines of code for the sine and cosine functions.\n",
        "I added a Pytorch method inspired by Pytorch.org to explore these methods.\n",
        "The main idea to keep in mind is that we are looking to add small values to the word embedding output so that the positions are taken into account. This means that as long as the cosine similarity, for example, displayed at the end of the notebook, shows the positions have been taken into account, the method can apply. Depending on the Transformer model, this method can be fine-tuned as well as using other methods.  "
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EmBUq9MzxQxz"
      },
      "source": [
        "pe1=aa.copy()\n",
        "pe2=aa.copy()\n",
        "pe3=aa.copy()\n",
        "paa=aa.copy()\n",
        "pba=ba.copy()\n",
        "d_model=512\n",
        "max_print=d_model\n",
        "max_length=20\n",
        "\n",
        "for i in range(0, max_print,2):\n",
        "                pe1[0][i] = math.sin(pos1 / (10000 ** ((2 * i)/d_model)))\n",
        "                paa[0][i] = (paa[0][i]*math.sqrt(d_model))+ pe1[0][i]\n",
        "                pe1[0][i+1] = math.cos(pos1 / (10000 ** ((2 * i)/d_model)))\n",
        "                paa[0][i+1] = (paa[0][i+1]*math.sqrt(d_model))+pe1[0][i+1]\n",
        "                if dprint==1:\n",
        "                        print(i,pe1[0][i],i+1,pe1[0][i+1])\n",
        "                        print(i,paa[0][i],i+1,paa[0][i+1])\n",
        "                        print(\"\\n\")\n",
        "\n",
        "#print(pe1)\n",
        "# A  method in Pytorch using torch.exp and math.log :\n",
        "max_len=max_length                \n",
        "pe = torch.zeros(max_len, d_model)\n",
        "position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)\n",
        "div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))\n",
        "pe[:, 0::2] = torch.sin(position * div_term)\n",
        "pe[:, 1::2] = torch.cos(position * div_term)\n",
        "#print(pe[:, 0::2])"
      ],
      "execution_count": 8,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "pgrXed2FwHDC",
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "outputId": "54dcdded-8470-47fc-999e-67294ee67dd2"
      },
      "source": [
        "\n",
        "for i in range(0, max_print,2):\n",
        "                pe2[0][i] = math.sin(pos2 / (10000 ** ((2 * i)/d_model)))\n",
        "                pba[0][i] = (pba[0][i]*math.sqrt(d_model))+ pe2[0][i]\n",
        "            \n",
        "                pe2[0][i+1] = math.cos(pos2 / (10000 ** ((2 * i)/d_model)))\n",
        "                pba[0][i+1] = (pba[0][i+1]*math.sqrt(d_model))+ pe2[0][i+1]\n",
        "               \n",
        "                if dprint==1:\n",
        "                        print(i,pe2[0][i],i+1,pe2[0][i+1])\n",
        "                        print(i,paa[0][i],i+1,paa[0][i+1])\n",
        "                        print(\"\\n\")\n",
        "\n",
        "print(word1,word2)\n",
        "cos_lib = cosine_similarity(aa, ba)\n",
        "print(cos_lib,\"word similarity\")\n",
        "cos_lib = cosine_similarity(pe1, pe2)\n",
        "print(cos_lib,\"positional similarity\")\n",
        "cos_lib = cosine_similarity(paa, pba)\n",
        "print(cos_lib,\"positional encoding similarity\")\n",
        "\n",
        "if dprint==1:\n",
        "        print(word1)\n",
        "        print(\"embedding\")\n",
        "        print(aa)\n",
        "        print(\"positional encoding\")\n",
        "        print(pe1)\n",
        "        print(\"encoded embedding\")\n",
        "        print(paa)\n",
        "\n",
        "        print(word2)\n",
        "        print(\"embedding\")\n",
        "        print(ba)\n",
        "        print(\"positional encoding\")\n",
        "        print(pe2)\n",
        "        print(\"encoded embedding\")\n",
        "        print(pba)\n",
        "\n"
      ],
      "execution_count": 9,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "black brown\n",
            "[[0.9998703]] word similarity\n",
            "[[0.8600013]] positional similarity\n",
            "[[0.96135795]] positional encoding similarity\n"
          ],
          "name": "stdout"
        }
      ]
    }
  ]
}